Find MDCs associated with Medicaid and/or Private Insurance payer types.
Analyzing MDC codes from all admissions in HCUP NY SID 2006-2012.
K-means clustering classifies MDCs into k groups such that MDCs within the same cluster are as similar as possible, and MDCs from different clusters are as dissimilar as possible. For our data, similarity is represented by the number of discharges/admissions from each payer type.
K-means defines clusters by trying to minimize the total within-cluster variation. The standard algorithm (Hartigan-Wong (1979)) defines the within-cluster variation as the sum of squared Euclidean distances between each MDC and its corresponding cluster centroid:
\[W(C_k) = \sum_{x_i \in C_k} (x_i - \mu_k)^2\]
where:
The algorithm tries the minimize the total within-cluster varition:
\[Total.Within.SS = \sum_{k=1}^{k} W(C_k) = \sum_{k=1}^{k} \sum_{x_i \in C_k} (x_i - \mu_k)^2\]
K-means algorithm can be summarized as:
Implemented k-means clustering for \(k=[2,15]\). Visual of clusters for \(k=[2,6]\).
Recall k-means defines clusters by minimizing the the total within-cluster variation (Total.Within.SS). We can plot the Total.Within.SS against the number of clusters k to decide the optimal number of clusters.
As k increases, the Total.Within.SS approaches 0. Generally, researchers use the “elbow method” of finding the value of k where the line bends as the point where there are diminishing returns in reducing the Total.Within.SS.